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ABSTRACT 


Hardware looping is a feature of some processor instruction sets whose hardware 
can repeat the body of a loop automatically, rather than requiring software instructions 
which take up cycles (and therefore time) to do so. Loop Unrolling is a loop 
transformation technique that attempts to advance a program's execution speed to the 
detriment of its twofold size, which is a methodology known as space-time tradeoff. A 
convolutional neural network is created with simple loops, with hardware looping, with 
loop unrolling and with both hardware looping and loop unrolling, and a comparison 
is made to evaluate the effectiveness of hardware looping and loop unrolling. The 
hardware loops alone will add to a cycle check decline, while the mix of hardware loops 
and dot product instructions will decrease the clock cycle tally further. The CNN is 
simulated on Xilinx Vivado 2021.1 running on Zync-7000 FPGA. 
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1. INTRODUCTION 


Profound learning calculations have seen achievement in a wide assortment of uses, for 
example, machine interpretation, picture and discourse acknowledgment, and self-driving 
vehicles. In any case, these calculations have as of late acquired traction in the installed 
frameworks space. Most implanted frameworks depend on modest microcontrollers with 
restricted memory limit, and, subsequently, are commonly seen as not equipped for running 
profound learning calculations. Nevertheless, we think about that progressions in pressure of 
neural organizations and neural organization engineering, combined with an improved guidance 
set design, could make microcontroller-grade processors appropriate for explicit low-force 
profound learning applications. Such intricacy is substantially a lot for memory obliged 
microcontrollers that have memory sizes indicated in kilobytes. Some embedded system 
designer’s work around the quandary of restricted resources by processing neural networks in 
the cloud. Nonetheless, this arrangement is restricted to regions with access to the Web. Cloud 
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preparing likewise has different burdens, for example, protection concerns, security, high 
idleness, correspondence power utilization, and dependability. These calculations perform 
massive arithmetic computations. To accelerate these calculations at a sensible cost in 
equipment, we can utilize an instruction set extension comprised of two instruction types- 
hardware loops and dot product instructions. The primary commitments of this paper are as per 
the following: 


e We propose an approach for computing neural network functions that are advanced for 
the utilization of hardware loops and dot product instructions. 


e The effectiveness of hardware loops and dot product instructions for performing deep 
learning functions are evaluated, and 


e To perform Lenet-5 Neural Network on Zync-7000 


There have been different ways to deal with accelerate profound learning capacities. The 
methodologies can be sorted into two sets. In the first set, the approaches which which attempt 
to enhance the size of the neural organizations, or, all in all, upgrade the software. Approaches 
in the second set try to optimize the equipment or hardware on which neural networks are 
running. As the set used here deals with hardware optimization, we will focus on the related 
approaches for hardware optimization, and only momentarily observe the progress in software 
optimizations. A simple instruction set 1s proposed to evaluate the effectiveness of hardware 
loops and dot product instructions with fully upgraded assembly functions for the fully 
connected convolutional neural network [1]. A custom instruction set architecture is used for 
efficient realization of artificial neural networks and can be parameterized to a subjective fixed- 
point design [2]. A CNN -specific Instruction set architecture 1s used which deploys the 
instruction parameters with high flexibility which embeds parallel computation and data reuse 
parameters in the instructions [3]. The instruction extensions and micro architectural 
advancements to increase computational thickness and to limit the amount of pressure towards 
shared memory hierarchy in RISC Processors [4]. A repetitive neural organization into a 
convolutional neural network, and the profound provisions of the picture were learnt in equal 
utilizing the convolutional neural network and the intermittent neural organization [5]. The 
framework of Complex Network Classifier (CNC) is used by integrating network embedding 
and convolutional neural network to tackle the problem of network classification. By training 
the classifier on synthetic complex network data, they showed that CNC can not only classify 
networks with high accuracy and robustness but can also extract the features of the networks 
automatically [6]. A valued prediction method is used to exploit the spatial correlation of zero- 
valued activations within the CNN output feature maps, thereby saving convolution operations 
[7]. The impact of packet loss on data integrity 1s reduced by taking advantage of the deep 
network’s ability to understand neural data and by using a data repair method based on 
convolutions neural network [8]. The instruction set simulation process is used soft-core 
Reduced Instruction Set Computer (RISC) processor. They provided reliable simulation 
platform in creating customizable instruction set for Application Specific Instruction Set 
Processor (ASIP) [9]. RISC-V ISA compatible processor and effects of instruction set is 
analyzed on the pipeline/micro-architecture design in terms of instructions encoding, 
functionality of instructions, instruction types, decoder logic complexity, data hazard detection, 
register file organization and access, functioning of pipeline, effect of branch instructions, 
control flow, data memory access, operating modes and execution unit hardware resources [10]. 
Deep learning algorithms are used increasingly in smart applications. Some of them also run in 
Internet of Things (IoT) devices. IoT Analytics reports that, by 2025, the number of IoT devices 
will rise to 22 billion. The motivation for our work stems from the fact that the rise of the IoT 
will increase the need for low-cost devices built around a single microcontroller capable of 


https:/Aaeme.com/Home/journal/IJECET editor @iaeme.com 


Bhargav Bhat and Abhay A Deshpande 


supporting deep learning algorithms. Accelerating deep learning inference in constrained 
embedded devices, presented in this paper, 1s our attempt in this direction. 


The remainder of this paper is organized as follows. Section II presents the related work in 
hardware and software enhancements aimed at accelerating the neural network computation. 
Section III shows the methodology concerned with the project. Section IV shows the results 
and the subsequent discussions regarding the obtained results. Section V presents the 
conclusion and plans for additional work for the future. 


2. BACKGROUND 
2.1. Convolutional Neural Networks 


A convolutional neural organization (CNN, or ConvNet) is a class of artificial neural 
organization, most generally applied to understand visual imagery. A CNN is a deep learning 
algorithm which can can take in an information picture, allot significance, to different 
viewpoints/objects in the picture and have the option to separate or differentiate one from the 
other. The pre-processing needed in a CNN is a lot of lower when contrasted with other 
characterization calculations. While in crude techniques channels are hand-designed, with 
enough preparing, CNNs can gain proficiency with these channels/qualities. CNN has a lot of 
applications in decoding facial recognition, analyzing documents, historic and environmental 
collections, understanding climate, grey areas to see holistic view of what a human sees, 
advertising and other fields. 


2.2. Lenet 


Lenet (also called Lenet-5) 1s an exemplary convolutional neural network which utilizes 
convolutions, pooling and fully connected layers. Lenet is used for handwritten digit 
recognition with the MNIST dataset. 


2.3. MNSIT Dataset 


MNIST is the acronym for Modified National Institute of Standards and Technology database. 
The MNIST database contains 60,000 training images and 10,000 testing images. MNIST 
dataset pictures have the measurements 28 x 28. To get the MNIST pictures measurement to 
the meet the necessities of the input layer, the 28 x 28 pictures are cushioned or padded. Some 
of the test pictures from MNIST test dataset are as shown in Fig 1. 
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Figure 1 Some Test pictures from MNIST dataset 
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2.4. Building Blocks of Convolutional Neural Network 


Convolutional Layer is the core center structure of CNN. In a CNN, the information is 
a tensor with a shape: (number of data sources) x (input height) x (¢nput width) x (input 
channels). In the wake of going through a convolutional layer, the picture becomes 
disconnected to an element map, likewise called an actuation map, with shape: (number 
of data sources) x (highlight map height)) x Gnclude map width) x (feature map 
channels). Convolutional layers convolve the information and pass its outcome to the 
following layer. 


Pooling Layer: Convolutional networks may incorporate local and/or global pooling 
layers alongside traditional convolutional layers. Pooling layers diminish the 
components of information by combining the outputs of neuron clusters at one layer 
into a single neuron in the following layer. Local pooling combines small clusters, tiling 
sizes such as 2 x 2 are commonly used. Global pooling acts on all the neurons of the 
feature map. There are two common types of pooling in popular use: max and average. 
Max pooling uses the maximum value of each local cluster of neurons in the feature 
map, while average pooling takes the average value. 


Fully Connected Layers: Fully connected layers connect every neuron in one layer to 
each neuron in another layer. It is equivalent to a conventional multi-layer perceptron 
neural network (MLP). The smoothed network goes through a fully connected layer to 
characterize the pictures. 


Receptive field: In neural networks, each neuron gets input from some number of 
locations in the past layer. In a convolutional layer, each neuron receives input from 
only a restricted area of the previous layer called the neuron's receptive field. Ordinarily 
the area is a square (e.g. 5 by 5 neurons). Whereas, in a fully connected layer, the 
receptive field is the entire past layer. Accordingly, in each convolutional layer, each 
neuron takes input from a bigger region in the input than previous layers. This is 
expected due to applying the convolution over and over, which takes into account the 
value of a pixel, as well as its encompassing pixels. When utilizing dilated layers, the 
number of pixels in the receptive field remains constant, but the field is more scantily 
populated as its measurements grow when combining the impact of several layers. 


Weights: Every neuron in a neural network registers an output value by applying a 
particular function to the input values received from the receptive field in the past layer. 
The function that is applied to the input values is dictated by a vector of weights and a 
bias (ordinarily real numbers). Learning consists of iteratively adjusting these biases 
and weights. 


The vector of weights and the bias are called filters and represent specific features of 
the input (e.g., a specific shape). A distinctive feature of CNNs is that many neurons 
can share the same filter. This decreases the memory footprint because a single bias and 
a single vector of weights are utilized across all receptive fields that share that filter, as 
opposed to each receptive field having its own bias and vector weighting. 


3. LENET ARCHITECTURE 


Lenet is a pre-trained Convolutional Neural Network Model used for recognizing handwritten 
and machine-printed characters. The organization has 5 layers with learnable boundaries and 
thus named Lenet-5. It has three arrangements of convolution layers with a blend of average 
pooling. After the convolution and average pooling layers, we have two fully connected layers. 
Finally, a Softmax classifier is used to characterize the pictures into separate class. The lenet 
layers are depicted in the following fig 2. 
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Figure 2 Architecture of Lenet-5 Model 


The following table is used to understand the architecture in more detail. 


Table 1 


Layer #filters/ Filter Stride Size of Activation 
—- euron Size Feature map as 
‘Input = ssid Poe fe] 32321 


east ——— | 1 ee 28x28x6 
AvgPoolingh | | 2x2 | 2d 


|AvgPooling2 | | 2x2 | 2 SxSxIG | 
|FullyConnected1 [| - | | 84 tah 
|FullyConnected2 | - | | | 0 Softmax 





The first layer is the input layer with highlight map size 32 x 32 x 1. 


Then, at that point, we have the first convolution layer with 6 channels of size 5 x 5 and 
step is |. The initiation work utilized at his layer is hyperbolic tangent (tanh). The output feature 
map is 28 x 28 x 6. 

Then, we have an average pooling layer with channel size 2 x 2 and step 1. The subsequent 
component map is 14 x 14 x 6. Since the pooling layer doesn't influence the quantity of 
channels. 

After this comes the second convolution layer with 16 filters of 5 x 5 and step 1. Also, the 
activation function is tanh. Now the output size is 10 x 10 x 16. 

Again comes the other average pooling layer of 2 x 2 with step 2. As a result, the size of the 
feature map reduced to 5 x5 x16. 

The final pooling layer has 120 filters of 5 x5 with stride | and activation function tanh. 
Now the output size is 120. 

The next is a fully connected layer with 84 neurons that result in the output to 84 values and 
the activation function used here is again tanh. 

The last layer is the output layer with 10 neurons and Softmax function. The Softmax gives 
the likelihood that an information point has a place with a specific class. The most elevated 
worth is then anticipated. 
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This is the entire architecture of the Lenet-5 model. The number of trainable parameters of 
this architecture 1s around 60,000. 


4. RESULTS 

The following table gives the hardware cost occupied by our design on the Zync-7000 board. 
It shows that the design occupies 47% of LUTs, 19% of LUTRAMS, 28% of FFs, 59% of 
BRAMs, 54% of DSP, and 3% of BUFG. Although, the implemented design can be displayed 


to gives us idea on how the design have been distributed, placed and routed on the selected 
Zync-7/000 board. 


Table 2 


Utilization Available Utilization % 
25456 53200 47.85 
LUTRAM 3478 17400 19.99 


4.1. Software Model (MATLAB) 


In the software model goes through the whole LeNet CNN and give us its prediction. Fig 3 
depicts result of the software model on matlab. 
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Figure 3 Matlab Result 


4.2. Hardware Model 


For the Hardware model, it gives us 10 outputs for the score of all 10 digits which are digit O, 
1,2, 3,4, 5, 6, 7, 8, and 9. The highest score should be the correct answer. So, you can see from 
all images the the highest score is the correct value. Fig 4 shows the simulation result from 
Xilinx Vivado. In the below example where the number 0 has been taken, the highest score 5 
shows that 0 is the predicted number. 
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Figure 4 Behavioral simulation on Xilinx Vivado 


4.3. Implementation 


Once the implementation is reached the implementation summary that summarizes all 
implantation report will be provided. The Fig 5 depicts the implemented design. 





Figure 5 Implemented Design of Lenet-5 


5. CONCLUSION 


In this paper, we proposed Lenet-5 Convolutional Neural Network with optimizations done 
using hardware looping and dot product units, which provided high accuracy when recognizing 
handwritten data using the MNSIT dataset. The hardware loops alone contribute to 24% cycle 
count decrease, while the dot products reduce the cycle count by 27%. As embedded systems 
are highly price-sensitive, this 1s an important consideration. Getting the sizes of neural 
networks down is an essential step in expanding the possibilities for neural networks in 
embedded systems. 
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An interesting topic for further research 1s Posit - an alternative floating-point number 
format that may offer additional advantages, as it has an increased dynamic range at the same 
word size. Because of the improved dynamic range, weights could be stored in lower precision, 
thus, again, decreasing the memory requirements. Combining the reduced size requirements 
with low-cost ISA improvements could make neural networks more ubiquitous in the price- 
sensitive embedded systems market. 
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